calibration map
Divide et Calibra: Multiclass Local Calibration via Vector Quantization
Barbera, Cesare, Perini, Lorenzo, De Toni, Giovanni, Passerini, Andrea, Pugnana, Andrea
Accurate and well-calibrated Machine Learning (ML) models are mandatory in high-stakes settings, yet effective multiclass calibration remains challenging: global approaches assume calibration errors are homogeneous across the latent space, while local methods often rely on latent-space dimensionality reduction, which leads to information loss. To address these issues, we propose a compositional approach to multiclass calibration, where region-specific calibration maps are constructed from shared codeword-dependent factors. We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions. Our approach learns heterogeneous calibration maps that generalize well even to sparse regions of the latent space. Experiments on benchmark datasets show significant improvements in local calibration while maintaining competitive global calibration and predictive performance.
Instance-Wise Monotonic Calibration by Constrained Transformation
Zhang, Yunrui, Batista, Gustavo, Kanhere, Salil S.
Deep neural networks often produce miscalibrated probability estimates, leading to overconfident predictions. A common approach for calibration is fitting a post-hoc calibration map on unseen validation data that transforms predicted probabilities. A key desirable property of the calibration map is instance-wise monotonicity (i.e., preserving the ranking of probability outputs). However, most existing post-hoc calibration methods do not guarantee monotonicity. Previous monotonic approaches either use an under-parameterized calibration map with limited expressive ability or rely on black-box neural networks, which lack interpretability and robustness. In this paper, we propose a family of novel monotonic post-hoc calibration methods, which employs a constrained calibration map parameterized linearly with respect to the number of classes. Our proposed approach ensures expressiveness, robustness, and interpretability while preserving the relative ordering of the probability output by formulating the proposed calibration map as a constrained optimization problem. Our proposed methods achieve state-of-the-art performance across datasets with different deep neural network models, outperforming existing calibration methods while being data and computation-efficient. Our code is available at https://github.com/YunruiZhang/Calibration-by-Constrained-Transformation
A comprehensive review of classifier probability calibration metrics
Probabilities or confidence values produced by artificial intelligence (AI) and machine learning (ML) models often do not reflect their true accuracy, with some models being under or over confident in their predictions. For example, if a model is 80% sure of an outcome, is it correct 80% of the time? Probability calibration metrics measure the discrepancy between confidence and accuracy, providing an independent assessment of model calibration performance that complements traditional accuracy metrics. Understanding calibration is important when the outputs of multiple systems are combined, for assurance in safety or business-critical contexts, and for building user trust in models. This paper provides a comprehensive review of probability calibration metrics for classifier and object detection models, organising them according to a number of different categorisations to highlight their relationships. We identify 82 major metrics, which can be grouped into four classifier families (point-based, bin-based, kernel or curve-based, and cumulative) and an object detection family. For each metric, we provide equations where available, facilitating implementation and comparison by future researchers.
On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers
Kรคngsepp, Markus, Valk, Kaspar, Kull, Meelis
Every uncalibrated classifier has a corresponding true calibration map that calibrates its confidence. Deviations of this idealistic map from the identity map reveal miscalibration. Such calibration errors can be reduced with many post-hoc calibration methods which fit some family of calibration maps on a validation dataset. In contrast, evaluation of calibration with the expected calibration error (ECE) on the test set does not explicitly involve fitting. However, as we demonstrate, ECE can still be viewed as if fitting a family of functions on the test data. This motivates the fit-on-the-test view on evaluation: first, approximate a calibration map on the test data, and second, quantify its distance from the identity. Exploiting this view allows us to unlock missed opportunities: (1) use the plethora of post-hoc calibration methods for evaluating calibration; (2) tune the number of bins in ECE with cross-validation. Furthermore, we introduce: (3) benchmarking on pseudo-real data where the true calibration map can be estimated very precisely; and (4) novel calibration and evaluation methods using new calibration map families PL and PL3.
Improving calibration by relating focal loss, temperature scaling, and properness
In machine learning classification tasks, achieving high accuracy is only part of the goal; it's equally important for models to express how confident they are in their predictions โ a concept known as model calibration. Well-calibrated models provide probability estimates that closely reflect the true likelihood of outcomes, which is critical in domains like healthcare, finance, and autonomous systems, where decision-making relies on trustworthy predictions. A key factor influencing both the accuracy and calibration of a model is the choice of the loss function during training. The loss function guides the model on how to learn from data by penalizing errors in prediction in a certain way. In this blog post, we will explore how to choose a loss function to achieve good calibration, focusing on the recently proposed focal loss and trying to understand why it leads to quite well-calibrated performance.
Calibrating Expressions of Certainty
Wang, Peiqi, Lam, Barbara D., Liu, Yingcheng, Asgari-Targhi, Ameneh, Panda, Rameswar, Wells, William M., Kapur, Tina, Golland, Polina
We present a novel approach to calibrating linguistic expressions of certainty, e.g., "Maybe" and "Likely". Unlike prior work that assigns a single score to each certainty phrase, we model uncertainty as distributions over the simplex to capture their semantics more accurately. To accommodate this new representation of certainty, we generalize existing measures of miscalibration and introduce a novel post-hoc calibration method. Leveraging these tools, we analyze the calibration of both humans (e.g., radiologists) and computational models (e.g., language models) and provide interpretable suggestions to improve their calibration. Measuring the calibration of humans and computational models is crucial. For example, in healthcare, radiologists express uncertainty in natural language (e.g., "Likely pneumonia") due to the inherent ambiguity in the image they examine. Additionally, it's more natural for large language models (LLMs) to express their confidence using certainty phrases since humans struggle with precise probability estimates (Zhang & Maloney, 2012). Our work enables measuring the calibration of both data annotators and LLMs, paving ways for future work to improve the reliability of LLMs. Existing miscalibration measures focus on classifiers that provide a confidence score, e.g., posterior probability. These approaches cannot be applied directly to text written by humans or language models that communicate uncertainty using natural language. Prior work on "verbalized confidence" attempted to address this by mapping certainty phrases to fixed probabilities, e.g., "High Confidence" equals "90% confident", (Lin et al., 2022a). The oversimplification misses two key aspects: (1) individual semantics: people use phrases like "High Confidence" to indicate a range (e.g., 80-100%) rather than a single value; and (2) population-level variation: different individuals may interpret the same certainty phrase differently. Appendix D explains this gap in more detail. Calibration in the space of certainty phrases presents unique challenges. Prior work such as histogram binning (Zadrozny & Elkan, 2001) and Platt scaling (Platt, 2000) fit low-dimensional functions (e.g., one-dimensional for binary classifiers) to map uncalibrated confidence scores to calibrated probabilities. However, when working with certainty phrases, direct manipulation of the underlying confidence scores is not feasible. In this work, we measure and calibrate both humans and computational models that convey their confidence using natural language expressions of certainty. The key idea is to treat certainty phrases as distributions over the probability simplex.
Improving Calibration by Relating Focal Loss, Temperature Scaling, and Properness
Komisarenko, Viacheslav, Kull, Meelis
Proper losses such as cross-entropy incentivize classifiers to produce class probabilities that are well-calibrated on the training data. Due to the generalization gap, these classifiers tend to become overconfident on the test data, mandating calibration methods such as temperature scaling. The focal loss is not proper, but training with it has been shown to often result in classifiers that are better calibrated on test data. Our first contribution is a simple explanation about why focal loss training often leads to better calibration than cross-entropy training. For this, we prove that focal loss can be decomposed into a confidence-raising transformation and a proper loss. This is why focal loss pushes the model to provide under-confident predictions on the training data, resulting in being better calibrated on the test data, due to the generalization gap. Secondly, we reveal a strong connection between temperature scaling and focal loss through its confidence-raising transformation, which we refer to as the focal calibration map. Thirdly, we propose focal temperature scaling - a new post-hoc calibration method combining focal calibration and temperature scaling. Our experiments on three image classification datasets demonstrate that focal temperature scaling outperforms standard temperature scaling.
Probabilistic Calibration by Design for Neural Network Regression
Dheur, Victor, Taieb, Souhaib Ben
Generating calibrated and sharp neural network predictive distributions for regression problems is essential for optimal decision-making in many real-world applications. To address the miscalibration issue of neural networks, various methods have been proposed to improve calibration, including post-hoc methods that adjust predictions after training and regularization methods that act during training. While post-hoc methods have shown better improvement in calibration compared to regularization methods, the post-hoc step is completely independent of model training. We introduce a novel end-to-end model training procedure called Quantile Recalibration Training, integrating post-hoc calibration directly into the training process without additional parameters. We also present a unified algorithm that includes our method and other post-hoc and regularization methods, as particular cases. We demonstrate the performance of our method in a large-scale experiment involving 57 tabular regression datasets, showcasing improved predictive accuracy while maintaining calibration. We also conduct an ablation study to evaluate the significance of different components within our proposed method, as well as an in-depth analysis of the impact of the base model and different hyperparameters on predictive accuracy.